Introduction

The objective of this notebook is to build an automated human activity recognition system. Main goal is to obtain highest cross-validated activity prediction performance by applying various data preprocessing and machine learning methods and tuning their parameters.

Labeled human acitivity used in this study is publicly available on Kaggle[1].

Throughout this workbook, I will follow an iterative process where I will go back and forward to apply various data visualization, data preprocessing and model-training methods while paying special attention on:

  • training time,
  • testing time,
  • prediction performance

My goal is, eventually, to learn more about the nature of the problem of activity recognition. I will mostly have an application developer view when I discuss the real life implications of the obtained results.

[1] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013 [ https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones ]

In [1]:
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

class_labels = ['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS', 'SITTING', 'STANDING', 'LAYING']

X_train = pd.read_csv('train.csv')
s_train = X_train['subject']
X_train.drop('subject', axis = 1, inplace = True)
y_train = X_train['Activity'].to_frame().reset_index()
X_train.drop('Activity', axis = 1, inplace = True)
y_train = y_train.replace(class_labels, [0, 1, 2, 3, 4, 5])

X_test = pd.read_csv('test.csv')
s_test = X_test['subject']
X_test.drop('subject', axis = 1, inplace = True)
y_test = X_test['Activity'].to_frame().reset_index()
X_test.drop(['Activity'], axis = 1, inplace = True)
y_test = y_test.replace(class_labels, [0, 1, 2, 3, 4, 5])

#NOTE: this contenation method (viz. append) is safer than concat
X = X_train.append(X_test, ignore_index=True)
y = y_train.append(y_test, ignore_index=True)

display(X.describe())
# display(y.describe())

#NOTE: append can adjust the index value of the appended dataframe whereas concatenation cannot. Concatenation may
#result in duplicate indices.
# dataframes = [X_train, X_test]
# X = pd.concat(dataframes)
# dataframes = [y_train, y_test]
# y = pd.concat(dataframes)
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-meanFreq() fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean)
count 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 ... 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000
mean 0.274347 -0.017743 -0.108925 -0.607784 -0.510191 -0.613064 -0.633593 -0.525697 -0.614989 -0.466732 ... 0.126708 -0.298592 -0.617700 0.007705 0.002648 0.017683 -0.009219 -0.496522 0.063255 -0.054284
std 0.067628 0.037128 0.053033 0.438694 0.500240 0.403657 0.413333 0.484201 0.399034 0.538707 ... 0.245443 0.320199 0.308796 0.336591 0.447364 0.616188 0.484770 0.511158 0.305468 0.268898
min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% 0.262625 -0.024902 -0.121019 -0.992360 -0.976990 -0.979137 -0.993293 -0.977017 -0.979064 -0.935788 ... -0.019481 -0.536174 -0.841847 -0.124694 -0.287031 -0.493108 -0.389041 -0.817288 0.002151 -0.131880
50% 0.277174 -0.017162 -0.108596 -0.943030 -0.835032 -0.850773 -0.948244 -0.843670 -0.845068 -0.874825 ... 0.136245 -0.335160 -0.703402 0.008146 0.007668 0.017192 -0.007186 -0.715631 0.182028 -0.003882
75% 0.288354 -0.010625 -0.097589 -0.250293 -0.057336 -0.278737 -0.302033 -0.087405 -0.288149 -0.014641 ... 0.288960 -0.113167 -0.487981 0.149005 0.291490 0.536137 0.365996 -0.521503 0.250790 0.102970
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 561 columns

Iteration 1: Comparison of baseline classifiers

Before going into a more detailed work on the features and model training and testing, I will apply some of the supervised machine learning methods to have some idea about the baseline performance of those methods.

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
from time import time

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

def train(clf, features, target):
    start = time()
    clf.fit(features, target)
    end = time()
    return end - start

def predict(clf, features):
    start = time()
    pred = clf.predict(features)
    end = time()
    return end - start, pred

clf_SGD = SGDClassifier(random_state = 42)
clf_Ada = AdaBoostClassifier(random_state = 42)
clf_DTR = DecisionTreeRegressor(random_state=42)
clf_KNC = KNeighborsClassifier()
clf_GNB = GaussianNB()
clf_SVM = SVC()

clfs = {clf_SGD, clf_Ada, clf_DTR, clf_KNC, clf_GNB, clf_SVM}

y_train_ = y_train['Activity']
y_test_ = y_test['Activity']
y_ = y['Activity']

for clf in clfs:
    printout = ""
    if clf == clf_SGD: printout = "SGD"
    elif clf == clf_Ada: printout = "Ada"
    elif clf == clf_DTR: printout = "DTR"
    elif clf == clf_KNC: printout = "KNC"
    elif clf == clf_GNB: printout = "GNB"
    elif clf == clf_SVM: printout = "SVM"

    results_precision = []
    results_recall = []
    results_fscore = []
    results_ttrain = []
    results_ttest = []
    kfold = cross_validation.KFold(X.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        start = time()
        clf.fit(X.iloc[train], y_[train])
        results_ttrain.append(time()-start)
        #NOTE: for some resason this line doesn't work.
#         t_train = train(clf, X.iloc[train], y_[train])
        t_test, y_pred = predict(clf, X.iloc[test])
        results_ttest.append(t_test)
        precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)        
        
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    printout += "  t_train: {:.4f}sec".format(np.mean(results_ttrain))
    printout += "  t_pred: {:.4f}sec".format(np.mean(results_ttest))
    print printout
GNB  precision: 0.80  recall: 0.73  fscore: 0.72  t_train: 0.2316sec  t_pred: 0.0533sec
SVM  precision: 0.94  recall: 0.94  fscore: 0.93  t_train: 13.0880sec  t_pred: 3.1674sec
SGD  precision: 0.95  recall: 0.94  fscore: 0.94  t_train: 0.4460sec  t_pred: 0.0050sec
Ada  precision: 0.37  recall: 0.54  fscore: 0.41  t_train: 41.1263sec  t_pred: 0.0405sec
DTR  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 7.0560sec  t_pred: 0.0027sec
KNC  precision: 0.91  recall: 0.91  fscore: 0.91  t_train: 0.9203sec  t_pred: 11.7007sec

Iteration 2: Having a closer look at the features

SVM and SGD have the highest precision, recall and f1-score. SGD is the quickest in prediction and second quickest in training. I will be using SVM as the main method while discovering different feature processing methods and at the and I will compare SGD with SVM.

Although SVM's cross-validated classification performance is already very high (p=0.95, r=0.94, f=0.94), further investigation might still yield in even a higher classification performance. For instance, removing the outliers is one way to improve the model. We can't visualize a 561-dimensional space in a human readable form but we can still have a look at how the features are distributed individually. I will plot the distribution of some of the features below.

Moreover, some of the features might be redundant. Redundant features can easily be distinguished by investigating the correlation between them and the other features. If there is high correlation with other features, that means there is no reason to carry this feature as the information represented by this feature is already conveyed through other features.

Therefore, correlation matrix will allow us to see the distribution of the feature values individually and correlation between the features as shown below. I will use SelectKBest method to choose a subset of features for further investigation. In order to decide on the number K, I will run an exhaustive training batch where I change the number K and monitor che change in the cross-validated prediction performance of the SVM model.

In [3]:
from sklearn.feature_selection import SelectKBest
import matplotlib.pyplot as plt
%matplotlib inline

d_kbest_to_precision = {}
d_kbest_to_recall = {}
d_kbest_to_f1score = {}

kbest_max = X.shape[1]/5
clf = clf_SVM

for kbest in range (2, kbest_max):
    f_selector = SelectKBest(k=kbest)
    Xs = f_selector.fit(X, y_).transform(X)
    printout = "kbest: {:3d}".format(kbest)

    results_precision = []
    results_recall = []
    results_fscore = []
    results_ttrain = []
    results_ttest = []
    kfold = cross_validation.KFold(Xs.shape[0], n_folds=4, shuffle=False, random_state=42)
    for train, test in kfold:
        start = time()
        clf.fit(Xs[train], y_[train])
        results_ttrain.append(time()-start)
        #NOTE: for some resason this line doesn't work.
#         t_train = train(clf, X.iloc[train], y_[train])
        t_test, y_pred = predict(clf, Xs[test])
        results_ttest.append(t_test)
        precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)        
        
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    printout += "  t_train: {:.3f}sec".format(np.mean(results_ttrain))
    printout += "  t_pred: {:.3f}sec".format(np.mean(results_ttest))
    print printout
    
    d_kbest_to_precision[kbest]=np.mean(results_precision)
    d_kbest_to_recall[kbest]=np.mean(results_recall)
    d_kbest_to_f1score[kbest]=np.mean(results_fscore)
kbest:   2  precision: 0.73  recall: 0.69  fscore: 0.67  t_train: 0.910sec  t_pred: 0.514sec
kbest:   3  precision: 0.73  recall: 0.70  fscore: 0.68  t_train: 0.920sec  t_pred: 0.534sec
kbest:   4  precision: 0.74  recall: 0.70  fscore: 0.69  t_train: 0.970sec  t_pred: 0.537sec
kbest:   5  precision: 0.73  recall: 0.70  fscore: 0.68  t_train: 0.987sec  t_pred: 0.559sec
kbest:   6  precision: 0.74  recall: 0.72  fscore: 0.70  t_train: 0.938sec  t_pred: 0.556sec
kbest:   7  precision: 0.74  recall: 0.72  fscore: 0.71  t_train: 0.979sec  t_pred: 0.587sec
kbest:   8  precision: 0.75  recall: 0.73  fscore: 0.71  t_train: 1.001sec  t_pred: 0.585sec
kbest:   9  precision: 0.76  recall: 0.73  fscore: 0.72  t_train: 1.002sec  t_pred: 0.600sec
kbest:  10  precision: 0.78  recall: 0.75  fscore: 0.74  t_train: 1.007sec  t_pred: 0.595sec
kbest:  11  precision: 0.81  recall: 0.79  fscore: 0.78  t_train: 0.863sec  t_pred: 0.520sec
kbest:  12  precision: 0.82  recall: 0.79  fscore: 0.77  t_train: 0.881sec  t_pred: 0.538sec
kbest:  13  precision: 0.82  recall: 0.80  fscore: 0.78  t_train: 0.885sec  t_pred: 0.544sec
kbest:  14  precision: 0.83  recall: 0.80  fscore: 0.78  t_train: 0.891sec  t_pred: 0.560sec
kbest:  15  precision: 0.83  recall: 0.80  fscore: 0.79  t_train: 0.899sec  t_pred: 0.583sec
kbest:  16  precision: 0.85  recall: 0.82  fscore: 0.80  t_train: 0.902sec  t_pred: 0.577sec
kbest:  17  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 0.919sec  t_pred: 0.601sec
kbest:  18  precision: 0.85  recall: 0.82  fscore: 0.80  t_train: 0.938sec  t_pred: 0.605sec
kbest:  19  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 0.958sec  t_pred: 0.617sec
kbest:  20  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 0.958sec  t_pred: 0.629sec
kbest:  21  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 0.985sec  t_pred: 0.644sec
kbest:  22  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.003sec  t_pred: 0.659sec
kbest:  23  precision: 0.85  recall: 0.82  fscore: 0.80  t_train: 1.029sec  t_pred: 0.677sec
kbest:  24  precision: 0.85  recall: 0.82  fscore: 0.80  t_train: 1.051sec  t_pred: 0.688sec
kbest:  25  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.071sec  t_pred: 0.721sec
kbest:  26  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.095sec  t_pred: 0.737sec
kbest:  27  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.105sec  t_pred: 0.740sec
kbest:  28  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.130sec  t_pred: 0.767sec
kbest:  29  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.150sec  t_pred: 0.771sec
kbest:  30  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 1.170sec  t_pred: 0.792sec
kbest:  31  precision: 0.85  recall: 0.83  fscore: 0.81  t_train: 1.185sec  t_pred: 0.793sec
kbest:  32  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.184sec  t_pred: 0.805sec
kbest:  33  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.197sec  t_pred: 0.815sec
kbest:  34  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.203sec  t_pred: 0.829sec
kbest:  35  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.228sec  t_pred: 0.845sec
kbest:  36  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.247sec  t_pred: 0.869sec
kbest:  37  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.240sec  t_pred: 0.863sec
kbest:  38  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.259sec  t_pred: 0.878sec
kbest:  39  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.266sec  t_pred: 0.888sec
kbest:  40  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.285sec  t_pred: 0.908sec
kbest:  41  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.316sec  t_pred: 0.920sec
kbest:  42  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.336sec  t_pred: 0.942sec
kbest:  43  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.344sec  t_pred: 0.963sec
kbest:  44  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.353sec  t_pred: 0.965sec
kbest:  45  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.376sec  t_pred: 0.967sec
kbest:  46  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.450sec  t_pred: 1.023sec
kbest:  47  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.432sec  t_pred: 1.007sec
kbest:  48  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.431sec  t_pred: 1.022sec
kbest:  49  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.455sec  t_pred: 1.036sec
kbest:  50  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.509sec  t_pred: 1.076sec
kbest:  51  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.508sec  t_pred: 1.083sec
kbest:  52  precision: 0.86  recall: 0.83  fscore: 0.82  t_train: 1.535sec  t_pred: 1.105sec
kbest:  53  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.522sec  t_pred: 1.096sec
kbest:  54  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.518sec  t_pred: 1.106sec
kbest:  55  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.558sec  t_pred: 1.130sec
kbest:  56  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.612sec  t_pred: 1.191sec
kbest:  57  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.649sec  t_pred: 1.181sec
kbest:  58  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.622sec  t_pred: 1.165sec
kbest:  59  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.622sec  t_pred: 1.174sec
kbest:  60  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.621sec  t_pred: 1.185sec
kbest:  61  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.640sec  t_pred: 1.196sec
kbest:  62  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.666sec  t_pred: 1.208sec
kbest:  63  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.674sec  t_pred: 1.227sec
kbest:  64  precision: 0.86  recall: 0.83  fscore: 0.81  t_train: 1.684sec  t_pred: 1.230sec
kbest:  65  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.655sec  t_pred: 1.205sec
kbest:  66  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.674sec  t_pred: 1.221sec
kbest:  67  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.698sec  t_pred: 1.239sec
kbest:  68  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.707sec  t_pred: 1.253sec
kbest:  69  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.728sec  t_pred: 1.262sec
kbest:  70  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.742sec  t_pred: 1.280sec
kbest:  71  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.754sec  t_pred: 1.290sec
kbest:  72  precision: 0.87  recall: 0.87  fscore: 0.87  t_train: 1.774sec  t_pred: 1.304sec
kbest:  73  precision: 0.87  recall: 0.87  fscore: 0.86  t_train: 1.791sec  t_pred: 1.327sec
kbest:  74  precision: 0.87  recall: 0.87  fscore: 0.86  t_train: 1.813sec  t_pred: 1.324sec
kbest:  75  precision: 0.87  recall: 0.86  fscore: 0.86  t_train: 1.833sec  t_pred: 1.341sec
kbest:  76  precision: 0.87  recall: 0.86  fscore: 0.86  t_train: 1.853sec  t_pred: 1.360sec
kbest:  77  precision: 0.87  recall: 0.86  fscore: 0.86  t_train: 1.861sec  t_pred: 1.371sec
kbest:  78  precision: 0.87  recall: 0.86  fscore: 0.86  t_train: 1.879sec  t_pred: 1.392sec
kbest:  79  precision: 0.87  recall: 0.87  fscore: 0.86  t_train: 1.910sec  t_pred: 1.418sec
kbest:  80  precision: 0.87  recall: 0.87  fscore: 0.86  t_train: 1.953sec  t_pred: 1.480sec
kbest:  81  precision: 0.87  recall: 0.87  fscore: 0.86  t_train: 1.958sec  t_pred: 1.458sec
kbest:  82  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.903sec  t_pred: 1.408sec
kbest:  83  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.940sec  t_pred: 1.423sec
kbest:  84  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.955sec  t_pred: 1.446sec
kbest:  85  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.966sec  t_pred: 1.452sec
kbest:  86  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 1.999sec  t_pred: 1.483sec
kbest:  87  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.013sec  t_pred: 1.485sec
kbest:  88  precision: 0.88  recall: 0.88  fscore: 0.87  t_train: 2.030sec  t_pred: 1.493sec
kbest:  89  precision: 0.88  recall: 0.88  fscore: 0.87  t_train: 2.046sec  t_pred: 1.498sec
kbest:  90  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.062sec  t_pred: 1.531sec
kbest:  91  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.089sec  t_pred: 1.534sec
kbest:  92  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.099sec  t_pred: 1.557sec
kbest:  93  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.125sec  t_pred: 1.569sec
kbest:  94  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.146sec  t_pred: 1.583sec
kbest:  95  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.159sec  t_pred: 1.599sec
kbest:  96  precision: 0.88  recall: 0.88  fscore: 0.87  t_train: 2.186sec  t_pred: 1.613sec
kbest:  97  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.212sec  t_pred: 1.632sec
kbest:  98  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.223sec  t_pred: 1.648sec
kbest:  99  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 2.254sec  t_pred: 1.663sec
kbest: 100  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.254sec  t_pred: 1.671sec
kbest: 101  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.269sec  t_pred: 1.678sec
kbest: 102  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.283sec  t_pred: 1.690sec
kbest: 103  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.300sec  t_pred: 1.707sec
kbest: 104  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.373sec  t_pred: 1.791sec
kbest: 105  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.363sec  t_pred: 1.752sec
kbest: 106  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.368sec  t_pred: 1.756sec
kbest: 107  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.375sec  t_pred: 1.771sec
kbest: 108  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.409sec  t_pred: 1.793sec
kbest: 109  precision: 0.88  recall: 0.88  fscore: 0.88  t_train: 2.427sec  t_pred: 1.808sec
kbest: 110  precision: 0.88  recall: 0.88  fscore: 0.87  t_train: 2.435sec  t_pred: 1.818sec
kbest: 111  precision: 0.88  recall: 0.88  fscore: 0.87  t_train: 2.464sec  t_pred: 1.835sec
In [7]:
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.grid(True)
major_ticks = np.arange(0, kbest_max, 20) 
minor_ticks = np.arange(0, kbest_max, 5)

# ax.set_xticks(major_ticks)                                                       
# ax.set_xticks(minor_ticks, minor=True) 
plt.xticks(minor_ticks)
plt.plot(d_kbest_to_precision.keys(), d_kbest_to_precision.values(), 'r',
        d_kbest_to_recall.keys(), d_kbest_to_recall.values(), 'g',
        d_kbest_to_f1score.keys(), d_kbest_to_f1score.values(), 'b')
plt.show()

Precision, recall and fscore values are calculated based on the class weighted average as there is imbalance between the number of class labels in the dataset
496 WALKING,
471 WALKING_UPSTAIRS,
420 WALKING_DOWNSTAIRS,
491 SITTING,
532 STANDING,
537 LAYING

I will consider the best 16 features to further investigate. This is where the classification scores peak for the first time, and it doesn't change that much after that point on.[NOTE: ADD LEGEND TO THE PLOT]

In [4]:
kbest_selected = 16
f_selector = SelectKBest(k=kbest_selected)
f_selector.fit(X, y['Activity'])
f_selected_indices = f_selector.get_support(indices=False)
Xs_cols = X.columns[f_selected_indices]
Xs = X[Xs_cols] # dataset with selected features
# display(Xs.describe())

Having normally distributed features is the fundamental assumption in many predictive models. Normal distribution is un-skewed. It means the probability of falling in the right or left side of the mean is equally likely. As we can see from the correlation matrix above, and the skewness test below, these features are quite skewed, even mostly bimodal. [NOTE: FURTHER DISCUSSION IS NEEDED]

In [5]:
import scipy.stats.stats as st
import operator

skness = st.skew(X)

d_feature2skew = {}
for skew, feature in zip(skness , X.columns.values.tolist()):
    d_feature2skew[feature]=skew
    
feature2skew = sorted(d_feature2skew.items(), key=operator.itemgetter(1), reverse=True)
for key, value in feature2skew:
    print str(value) + " " + str(key)
14.0274421291 fBodyAccJerk-bandsEnergy()-57,64
12.3582190015 fBodyGyro-bandsEnergy()-33,40.1
12.3405253118 tGravityAcc-iqr()-X
11.4406066343 tGravityAcc-mad()-X
11.1342039411 tGravityAcc-std()-X
10.2724333963 fBodyGyro-bandsEnergy()-33,48.1
9.46984761412 tGravityAcc-iqr()-Y
9.15793962521 fBodyGyro-bandsEnergy()-33,40.2
8.9289275525 tGravityAcc-mad()-Y
8.78163491799 tGravityAcc-std()-Y
8.51410725858 fBodyAccJerk-bandsEnergy()-57,64.1
8.23969565046 fBodyGyro-bandsEnergy()-57,64.1
8.21504807324 fBodyGyro-bandsEnergy()-33,48.2
8.15162096692 fBodyAccJerk-bandsEnergy()-57,64.2
8.0421632488 fBodyGyro-bandsEnergy()-57,64
7.89396800077 fBodyGyro-bandsEnergy()-25,48.1
7.81368318394 fBodyGyro-bandsEnergy()-57,64.2
7.8117806773 fBodyAccJerk-bandsEnergy()-33,40.2
7.49245465563 fBodyGyro-bandsEnergy()-25,32.1
7.43064576666 fBodyGyro-bandsEnergy()-25,48.2
7.37033016756 fBodyGyro-bandsEnergy()-49,64
7.30307814418 fBodyAcc-bandsEnergy()-33,40.2
7.2757680758 tGravityAcc-iqr()-Z
7.25359421168 fBodyGyro-bandsEnergy()-17,24.1
7.02811277083 fBodyGyro-bandsEnergy()-25,32.2
7.01664123817 tGravityAcc-mad()-Z
6.92781847882 tGravityAcc-std()-Z
6.79014465903 fBodyGyro-bandsEnergy()-49,56
6.35607108151 fBodyGyro-bandsEnergy()-41,48.1
6.3519826935 fBodyGyro-bandsEnergy()-49,64.2
6.34333429702 fBodyGyro-bandsEnergy()-9,16.1
6.33657460356 fBodyGyro-bandsEnergy()-17,32.1
6.29905585908 fBodyGyro-bandsEnergy()-41,48.2
6.18683017123 fBodyGyro-bandsEnergy()-49,64.1
6.11414792659 fBodyAccJerk-bandsEnergy()-33,48.2
6.08827213752 fBodyAccJerk-bandsEnergy()-25,32.2
6.02475857025 fBodyAccJerk-bandsEnergy()-25,48.2
6.02229862855 fBodyGyro-bandsEnergy()-49,56.1
5.96164826714 fBodyGyro-bandsEnergy()-25,32
5.96009248698 fBodyGyro-bandsEnergy()-33,40
5.93584433775 fBodyAcc-bandsEnergy()-57,64.2
5.91588430745 fBodyAcc-bandsEnergy()-25,32.2
5.87487460185 fBodyAcc-bandsEnergy()-33,48.2
5.79595156173 fBodyAcc-bandsEnergy()-25,48.2
5.70075950591 fBodyAcc-bandsEnergy()-57,64.1
5.57328329589 fBodyGyro-bandsEnergy()-25,48
5.39734500466 fBodyGyro-bandsEnergy()-49,56.2
5.28556934279 fBodyBodyGyroJerkMag-maxInds
5.20335604517 fBodyGyro-bandsEnergy()-33,48
5.19471100405 fBodyGyro-bandsEnergy()-41,48
5.01469251886 fBodyBodyAccJerkMag-maxInds
4.98416905982 fBodyAcc-bandsEnergy()-57,64
4.889908241 tBodyGyroJerk-energy()-Y
4.75113147262 fBodyGyro-bandsEnergy()-9,16.2
4.676335405 fBodyBodyGyroJerkMag-energy()
4.62452087934 fBodyAcc-bandsEnergy()-49,64.2
4.59780071293 fBodyAccJerk-bandsEnergy()-41,48.2
4.51718190618 fBodyAcc-bandsEnergy()-41,48.2
4.44474574359 fBodyAcc-bandsEnergy()-49,64
4.25988077353 fBodyGyro-bandsEnergy()-17,24.2
4.2297130126 fBodyAcc-bandsEnergy()-49,56.2
4.1858605803 fBodyAccJerk-bandsEnergy()-17,32.2
4.09906618365 fBodyAcc-bandsEnergy()-49,56
4.06162215152 fBodyAcc-bandsEnergy()-49,64.1
4.04507460345 fBodyGyro-bandsEnergy()-17,32.2
3.99292463278 fBodyAccJerk-bandsEnergy()-49,56.2
3.99177784967 fBodyAccJerk-bandsEnergy()-49,64.2
3.93546285531 fBodyAcc-bandsEnergy()-17,24.2
3.93463660304 fBodyAccJerk-bandsEnergy()-49,64
3.92920919826 fBodyAccJerk-bandsEnergy()-49,56
3.91929238835 fBodyAcc-bandsEnergy()-17,32.2
3.8231253941 tBodyGyroJerkMag-energy()
3.80409109096 fBodyAccJerk-bandsEnergy()-17,24.2
3.75557639651 fBodyGyro-min()-X
3.66852703812 fBodyGyro-bandsEnergy()-1,8.2
3.32832743672 fBodyAccJerk-bandsEnergy()-33,40.1
3.31674074595 fBodyAcc-bandsEnergy()-49,56.1
3.30127703451 fBodyAccJerk-energy()-Z
3.3003295991 tBodyAccJerk-energy()-Z
3.2601938792 fBodyGyro-bandsEnergy()-1,8
3.24112637398 tBodyGyroJerk-energy()-Z
3.16108569892 fBodyGyro-bandsEnergy()-17,32
3.15860865885 fBodyGyro-bandsEnergy()-1,16.2
3.15695398264 fBodyGyro-bandsEnergy()-1,8.1
3.14492920624 fBodyAcc-bandsEnergy()-33,40.1
3.13299368395 fBodyAcc-bandsEnergy()-41,48
3.13165825465 fBodyGyro-min()-Z
3.09306441279 fBodyAccJerk-kurtosis()-Y
3.08181367606 fBodyAcc-min()-Z
3.07861497393 fBodyAccJerk-bandsEnergy()-49,64.1
3.06959234778 fBodyGyro-min()-Y
3.06142206628 fBodyAccJerk-bandsEnergy()-25,32.1
3.05103206797 tBodyGyroJerk-energy()-X
3.02815577794 fBodyGyro-bandsEnergy()-17,24
3.0202157803 fBodyAcc-bandsEnergy()-25,32.1
3.00879643509 fBodyAccJerk-bandsEnergy()-49,56.1
2.99641078034 fBodyAccJerk-bandsEnergy()-9,16.2
2.98624322256 fBodyGyro-bandsEnergy()-9,16
2.95716083346 tBodyGyro-energy()-Z
2.95395782692 fBodyGyro-bandsEnergy()-1,16.1
2.9410860016 fBodyAcc-bandsEnergy()-9,16.2
2.92033989182 fBodyAccJerk-bandsEnergy()-1,8.2
2.91764446591 fBodyGyro-bandsEnergy()-1,24.2
2.89967914491 fBodyGyro-bandsEnergy()-1,16
2.8924360657 fBodyAccJerk-kurtosis()-Z
2.88808169999 fBodyBodyGyroMag-min()
2.8716980296 fBodyAccJerk-bandsEnergy()-25,32
2.8708380889 fBodyAccJerk-bandsEnergy()-41,48
2.83905120874 fBodyGyro-energy()-Z
2.83090612019 fBodyAccJerk-bandsEnergy()-33,40
2.82740155386 fBodyGyro-bandsEnergy()-1,24
2.81588617031 fBodyAccMag-min()
2.80447415043 fBodyGyro-energy()-X
2.79823345941 fBodyGyro-energy()-Y
2.79003247183 fBodyGyro-bandsEnergy()-1,24.1
2.78257316546 fBodyAccJerk-bandsEnergy()-1,24.2
2.7706487508 fBodyAcc-bandsEnergy()-25,32
2.76255744324 tBodyGyro-energy()-Y
2.75066169455 fBodyAccJerk-min()-Z
2.74009905589 fBodyAcc-bandsEnergy()-33,40
2.7385321743 fBodyAcc-min()-Y
2.71054665393 fBodyGyro-maxInds-X
2.67372731772 fBodyAcc-bandsEnergy()-33,48.1
2.66225442743 fBodyAccJerk-bandsEnergy()-33,48.1
2.65260703339 fBodyAcc-bandsEnergy()-41,48.1
2.6335532331 fBodyAcc-bandsEnergy()-33,48
2.63053344183 fBodyAcc-bandsEnergy()-25,48.1
2.62423588132 fBodyAccJerk-bandsEnergy()-41,48.1
2.62077556636 fBodyAcc-maxInds-Z
2.61589403969 fBodyAcc-bandsEnergy()-1,8.2
2.61566522733 fBodyAccJerk-bandsEnergy()-25,48.1
2.60946390506 fBodyAccJerk-bandsEnergy()-1,16.2
2.59256320402 tGravityAcc-entropy()-Y
2.59011669381 fBodyBodyGyroMag-energy()
2.54422208703 fBodyAccJerk-bandsEnergy()-33,48
2.50462343685 fBodyAcc-min()-X
2.49067063891 fBodyAccJerk-min()-X
2.45866310095 fBodyAccJerk-bandsEnergy()-1,8
2.44758886656 fBodyAcc-bandsEnergy()-25,48
2.42840475392 fBodyAccJerk-bandsEnergy()-25,48
2.40285801332 fBodyBodyGyroMag-maxInds
2.37346365479 fBodyAccJerk-min()-Y
2.36126874056 tGravityAcc-energy()-Z
2.3568462525 tBodyGyro-energy()-X
2.35645334898 fBodyBodyGyroJerkMag-min()
2.3455219318 fBodyAccJerk-bandsEnergy()-9,16.1
2.34514812817 fBodyAcc-bandsEnergy()-1,16.2
2.336852093 fBodyAcc-bandsEnergy()-9,16
2.31396289398 fBodyAccJerk-bandsEnergy()-17,24
2.26633907716 fBodyAcc-bandsEnergy()-17,24
2.24978653007 fBodyAccJerk-bandsEnergy()-9,16
2.24701526088 tGravityAcc-energy()-Y
2.24474898421 fBodyAcc-bandsEnergy()-9,16.1
2.22358605035 tBodyAcc-energy()-Z
2.18854227131 fBodyAccJerk-bandsEnergy()-17,32
2.14868233163 fBodyAcc-bandsEnergy()-1,24.2
2.13276463034 fBodyAcc-bandsEnergy()-17,32
2.10651790552 fBodyAcc-energy()-Z
2.07793048557 fBodyGyro-maxInds-Y
2.07009971894 fBodyAcc-bandsEnergy()-17,24.1
2.045096371 fBodyAccJerk-bandsEnergy()-17,24.1
2.03407449851 fBodyAccJerk-bandsEnergy()-17,32.1
2.01684782491 fBodyAccJerk-bandsEnergy()-1,16
1.99964199588 fBodyAccJerk-bandsEnergy()-1,16.1
1.9882925733 fBodyAcc-bandsEnergy()-17,32.1
1.95792153439 fBodyAccJerk-kurtosis()-X
1.94933817053 fBodyAcc-maxInds-X
1.93151286553 tBodyGyroJerk-max()-Y
1.91887830887 tBodyAcc-energy()-Y
1.90848665811 fBodyAccJerk-bandsEnergy()-1,8.1
1.90569071283 fBodyAcc-bandsEnergy()-1,8
1.89850253278 fBodyAccMag-kurtosis()
1.8433187822 fBodyBodyAccJerkMag-min()
1.80925018427 fBodyAcc-bandsEnergy()-1,16
1.80650371815 fBodyGyro-maxInds-Z
1.79530053652 fBodyAccJerk-energy()-X
1.79493081519 tBodyAccJerk-energy()-X
1.77582077434 fBodyAccJerk-bandsEnergy()-1,24
1.76991705001 fBodyBodyAccJerkMag-energy()
1.75939151981 tBodyAcc-mean()-Z
1.74743273066 fBodyAcc-bandsEnergy()-1,24
1.7366745789 fBodyAcc-energy()-X
1.7348920256 tBodyAcc-energy()-X
1.73477173926 fBodyAcc-maxInds-Y
1.7089686733 tBodyGyroMag-energy()
1.69826265506 fBodyAccJerk-bandsEnergy()-1,24.1
1.66158188414 tBodyAccJerkMag-energy()
1.65833564923 fBodyAccMag-energy()
1.65279262445 fBodyBodyGyroJerkMag-max()
1.62957406303 fBodyAccJerk-energy()-Y
1.6291553282 tBodyAccJerk-energy()-Y
1.62234947919 fBodyAcc-kurtosis()-Y
1.61939135517 fBodyBodyGyroJerkMag-std()
1.60008238638 tBodyGyroJerk-std()-Y
1.59115715938 tBodyGyroJerkMag-max()
1.56850976571 fBodyBodyGyroJerkMag-mad()
1.56789558629 fBodyGyro-max()-Y
1.53079497022 fBodyBodyGyroJerkMag-kurtosis()
1.52215074375 tBodyGyroJerk-mad()-Y
1.5159992198 tBodyAccJerk-max()-Z
1.49168954209 fBodyBodyGyroJerkMag-iqr()
1.46430027007 tBodyGyro-max()-Y
1.46409820924 tBodyGyroJerk-iqr()-Y
1.46387104576 tBodyGyroJerkMag-std()
1.43097916293 fBodyGyro-iqr()-Y
1.42296186491 angle(X,gravityMean)
1.42293058033 fBodyBodyGyroMag-kurtosis()
1.39799851089 fBodyAccJerk-max()-Z
1.3977815421 fBodyAcc-bandsEnergy()-1,8.1
1.39776607041 fBodyBodyGyroJerkMag-mean()
1.39776607041 fBodyBodyGyroJerkMag-sma()
1.3755465212 tBodyGyroJerkMag-min()
1.36135059225 fBodyGyro-max()-Z
1.35945450834 tBodyGyroJerkMag-mad()
1.31261622275 fBodyAcc-bandsEnergy()-1,16.1
1.29581582098 fBodyGyro-kurtosis()-Y
1.29213257034 fBodyBodyAccJerkMag-kurtosis()
1.27634939598 fBodyAcc-bandsEnergy()-1,24.1
1.26181895937 fBodyAcc-energy()-Y
1.23968057574 tBodyGyroJerk-max()-Z
1.23073733675 tBodyGyroJerkMag-iqr()
1.2170597452 fBodyAccMag-maxInds
1.21398463721 tBodyAccJerkMag-min()
1.20000369521 tGravityAccMag-energy()
1.20000369521 tBodyAccMag-energy()
1.18918166351 tBodyGyroMag-min()
1.17353661325 tGravityAcc-min()-Y
1.16693831205 tGravityAcc-mean()-Y
1.16535896438 fBodyAccMag-skewness()
1.16513795296 fBodyAccJerk-iqr()-Z
1.15873029322 tGravityAcc-max()-Y
1.14593048845 fBodyGyro-kurtosis()-Z
1.10942102955 fBodyAcc-skewness()-Y
1.10860332085 fBodyAccJerk-std()-Z
1.10115843382 fBodyGyro-max()-X
1.09835986696 fBodyAccJerk-skewness()-Z
1.09433852001 tBodyGyro-iqr()-Z
1.08825818362 fBodyAcc-iqr()-Z
1.08821253386 tGravityAccMag-min()
1.08821253386 tBodyAccMag-min()
1.08662739876 fBodyBodyGyroMag-max()
1.07952031179 fBodyGyro-kurtosis()-X
1.07931144188 tBodyGyroJerk-max()-X
1.07665757147 tBodyAccJerk-iqr()-Z
1.07518663196 fBodyGyro-std()-Y
1.07218104917 fBodyGyro-mean()-Y
1.07197804091 tBodyAccJerk-max()-X
1.06901621026 fBodyAccJerk-mad()-Z
1.05567901482 tGravityAcc-entropy()-Z
1.0501379703 tBodyAccJerk-std()-Z
1.04887094126 tBodyAccJerk-max()-Y
1.04219838415 fBodyGyro-mad()-Y
1.0405543979 fBodyAccJerk-mean()-Z
1.03199153378 tBodyAccJerk-mad()-Z
1.02955007896 tBodyGyro-iqr()-Y
1.01349820481 fBodyGyro-iqr()-Z
1.00883017514 fBodyAcc-kurtosis()-Z
1.00743628222 tBodyGyro-max()-X
0.999404578414 tBodyGyro-std()-Y
0.998595950474 fBodyBodyGyroMag-iqr()
0.986389047812 tBodyGyroJerkMag-sma()
0.986389047812 tBodyGyroJerkMag-mean()
0.981398684588 fBodyAccJerk-skewness()-Y
0.976344174031 tBodyGyro-mad()-Y
0.974333508872 fBodyGyro-iqr()-X
0.95958198408 tBodyGyroJerk-std()-Z
0.956959816952 fBodyAcc-max()-Z
0.932021451793 tBodyGyroJerk-sma()
0.921737927362 fBodyAcc-kurtosis()-X
0.899971132243 tBodyGyro-max()-Z
0.897149099437 fBodyBodyGyroMag-sma()
0.897149099437 fBodyBodyGyroMag-mean()
0.896579765667 fBodyBodyGyroMag-std()
0.894074155545 tBodyGyroJerk-iqr()-X
0.891713039859 fBodyGyro-std()-Z
0.887792803809 tBodyGyroJerk-iqr()-Z
0.885977202538 tBodyGyroJerk-mad()-Z
0.885562703711 fBodyBodyAccJerkMag-max()
0.878483286913 tBodyGyroJerk-std()-X
0.876383470075 fBodyAcc-max()-X
0.868902058375 fBodyGyro-std()-X
0.865735592061 tGravityAcc-entropy()-X
0.864427916953 tBodyGyro-iqr()-X
0.863258341578 tBodyGyroMag-max()
0.859952417894 tBodyGyroJerk-mad()-X
0.855588638168 fBodyAccJerk-max()-Y
0.854968192851 fBodyAccMag-iqr()
0.853000598265 fBodyAccMag-max()
0.852631117761 tBodyAcc-iqr()-X
0.842219193579 tBodyGyro-mad()-Z
0.839751900174 fBodyAccJerk-max()-X
0.819382972131 fBodyBodyGyroMag-mad()
0.810580413835 tBodyGyroMag-std()
0.810395438671 tBodyAcc-max()-Z
0.807857692993 fBodyAccJerk-iqr()-X
0.803066493061 fBodyAcc-iqr()-X
0.80030957345 tBodyGyro-mad()-X
0.795330868908 tBodyGyro-std()-X
0.793890804483 fBodyAccJerk-skewness()-X
0.787113694259 fBodyBodyGyroMag-skewness()
0.7778882466 tBodyGyro-std()-Z
0.768802271761 fBodyBodyAccJerkMag-iqr()
0.767551861107 tBodyGyroMag-iqr()
0.759804971223 tBodyGyroMag-mad()
0.742596813893 fBodyAccJerk-iqr()-Y
0.740168755755 fBodyAcc-iqr()-Y
0.739820217699 fBodyGyro-mean()-X
0.738337523911 fBodyGyro-skewness()-Y
0.736326449457 fBodyAcc-mean()-Z
0.734125201145 tBodyAccMag-iqr()
0.734125201145 tGravityAccMag-iqr()
0.731562470842 fBodyGyro-mad()-X
0.72519230022 tBodyAccJerkMag-iqr()
0.725043436567 fBodyBodyAccJerkMag-std()
0.719182210998 tBodyAccJerk-iqr()-X
0.717970781452 tGravityAcc-min()-Z
0.714933578877 tGravityAcc-mean()-Z
0.714516797617 fBodyAcc-std()-Z
0.710484576166 tGravityAcc-max()-Z
0.709445199772 fBodyGyro-mad()-Z
0.706242466919 tBodyAccJerkMag-max()
0.699368882229 fBodyGyro-mean()-Z
0.698429267795 fBodyAcc-mad()-Z
0.696713240294 fBodyBodyAccJerkMag-mad()
0.692754520232 fBodyBodyGyroJerkMag-skewness()
0.689904357473 fBodyAccJerk-mean()-X
0.686560508799 fBodyAccJerk-std()-Y
0.684484457536 tBodyAcc-mad()-X
0.679540466169 fBodyBodyAccJerkMag-mean()
0.679540466169 fBodyBodyAccJerkMag-sma()
0.678208347527 tBodyAccJerkMag-mad()
0.676185698675 fBodyAcc-std()-X
0.675303264928 fBodyAccJerk-std()-X
0.675106725061 tBodyAccJerkMag-std()
0.673583783075 fBodyAccJerk-mad()-X
0.669183198287 fBodyAccJerk-mad()-Y
0.665097401768 tBodyAccJerk-std()-X
0.661803542963 tBodyAccJerk-mad()-X
0.661588792698 tBodyAccJerk-iqr()-Y
0.66093300966 tBodyAcc-std()-Z
0.660807503434 fBodyAccMag-mad()
0.653564973792 fBodyGyro-sma()
0.65252508798 fBodyAccMag-mean()
0.65252508798 fBodyAccMag-sma()
0.651507095393 fBodyAccMag-std()
0.651217278251 tBodyAcc-iqr()-Z
0.642669231697 tBodyAccJerk-std()-Y
0.641100176633 fBodyAcc-max()-Y
0.63692184831 tBodyAcc-std()-X
0.634672943249 fBodyAccJerk-mean()-Y
0.627176518078 tBodyAcc-mad()-Z
0.623552618043 tGravityAccMag-mad()
0.623552618043 tBodyAccMag-mad()
0.622762775737 fBodyAccJerk-sma()
0.617651845516 fBodyAcc-mean()-X
0.617282993415 tBodyAccMag-std()
0.617282993415 tGravityAccMag-std()
0.614430566037 fBodyBodyAccJerkMag-skewness()
0.612580907669 fBodyGyro-skewness()-Z
0.611575881537 tBodyAccJerk-mad()-Y
0.608103538332 fBodyAcc-mad()-X
0.601852530578 tBodyAcc-max()-X
0.60133410791 tBodyAccJerk-sma()
0.593972814384 tBodyAccJerkMag-mean()
0.593972814384 tBodyAccJerkMag-sma()
0.588132924519 fBodyAcc-skewness()-Z
0.583765832366 tBodyAcc-max()-Y
0.564614326445 tBodyAcc-correlation()-X,Y
0.557655266106 tBodyAccMag-max()
0.557655266106 tGravityAccMag-max()
0.541058634127 tBodyAcc-iqr()-Y
0.516584338001 tBodyGyroMag-sma()
0.516584338001 tBodyGyroMag-mean()
0.51320896886 tBodyGyro-sma()
0.508417501483 fBodyGyro-skewness()-X
0.508043011358 tGravityAcc-arCoeff()-X,3
0.493217019657 fBodyAcc-mean()-Y
0.477907809184 fBodyAccJerk-maxInds-Y
0.477263627038 fBodyAcc-mad()-Y
0.469180783726 fBodyAcc-sma()
0.453971605378 fBodyAcc-skewness()-X
0.443094366228 fBodyAcc-std()-Y
0.436261160742 tBodyAcc-mad()-Y
0.435232982925 tBodyAcc-std()-Y
0.417992043488 tBodyAcc-arCoeff()-Z,2
0.408332492233 tBodyAccMag-mean()
0.408332492233 tGravityAccMag-mean()
0.408332492233 tGravityAccMag-sma()
0.408332492233 tBodyAccMag-sma()
0.396432265548 tBodyAccJerkMag-arCoeff()2
0.389602855112 tBodyAcc-sma()
0.343362509718 tBodyAcc-arCoeff()-X,2
0.329196303405 tGravityAcc-arCoeff()-X,1
0.322390135742 tBodyGyroJerkMag-arCoeff()4
0.289900529157 tBodyGyroJerkMag-arCoeff()2
0.273219057633 tBodyAcc-arCoeff()-Y,2
0.269810028483 fBodyAccJerk-entropy()-Z
0.268717083068 tGravityAcc-sma()
0.268626982874 tGravityAcc-arCoeff()-Z,3
0.261247012195 tBodyGyro-arCoeff()-X,1
0.244053232175 tGravityAcc-arCoeff()-Y,3
0.231408007738 tBodyGyroJerk-arCoeff()-Y,1
0.225381717128 fBodyBodyAccJerkMag-entropy()
0.222750998528 tBodyGyro-correlation()-X,Y
0.221751063028 tBodyGyroJerk-arCoeff()-X,1
0.214785397225 tGravityAcc-correlation()-X,Z
0.214095214436 tBodyGyro-entropy()-X
0.208069769381 tBodyAcc-correlation()-X,Z
0.200122462248 fBodyAccJerk-entropy()-X
0.194580580474 tBodyGyroMag-arCoeff()2
0.193893906745 tGravityAcc-arCoeff()-Z,1
0.193645469357 tBodyGyro-mean()-Y
0.193019223009 fBodyAccJerk-entropy()-Y
0.188543153359 fBodyAccJerk-maxInds-X
0.180856348488 tBodyGyroJerk-arCoeff()-Z,1
0.170575928287 tBodyGyro-correlation()-X,Z
0.16913457338 tBodyGyro-arCoeff()-Z,1
0.165858145244 tBodyAccJerk-entropy()-Z
0.164544722854 tBodyGyro-arCoeff()-Y,1
0.158932692684 tBodyGyroJerk-correlation()-Y,Z
0.154812849149 fBodyAccJerk-meanFreq()-Y
0.154195552001 tBodyAccMag-arCoeff()2
0.154195552001 tGravityAccMag-arCoeff()2
0.147160728856 fBodyAcc-meanFreq()-X
0.139300306473 tBodyGyroJerk-mean()-X
0.136991541129 fBodyAcc-entropy()-Z
0.125307722428 fBodyAcc-entropy()-X
0.121445179385 fBodyBodyGyroJerkMag-entropy()
0.118747530895 tBodyAccJerkMag-arCoeff()3
0.118471361957 tBodyAcc-arCoeff()-Y,1
0.112677206662 tBodyAccJerk-mean()-X
0.10593815062 tBodyAccJerk-entropy()-X
0.101139651259 tBodyAccJerk-correlation()-X,Y
0.099979327303 tGravityAcc-arCoeff()-Y,1
0.0991962249851 tBodyAcc-arCoeff()-X,1
0.0953001247937 tBodyAccJerkMag-arCoeff()4
0.0939906319389 tBodyGyro-arCoeff()-X,2
0.0921940628055 fBodyAcc-meanFreq()-Y
0.0907095754366 fBodyBodyGyroMag-meanFreq()
0.0853114441915 fBodyAccMag-entropy()
0.0827741497792 tBodyAccJerkMag-entropy()
0.0760363198177 tBodyGyroJerk-correlation()-X,Z
0.0741980322368 tBodyAccJerk-entropy()-Y
0.0710436767198 tBodyGyroMag-arCoeff()1
0.0700384546029 fBodyAcc-entropy()-Y
0.0646220364334 tBodyAccJerk-arCoeff()-Y,1
0.0632301422719 tBodyGyroJerk-arCoeff()-Z,2
0.0630934868348 tBodyGyro-arCoeff()-Z,2
0.0580587212785 fBodyGyro-meanFreq()-Z
0.0557854012593 tBodyAccMag-arCoeff()1
0.0557854012593 tGravityAccMag-arCoeff()1
0.0519532219898 tBodyGyroJerk-mean()-Z
0.0476021732207 tBodyGyroJerk-mean()-Y
0.0347704422585 fBodyAccJerk-maxInds-Z
0.0339294282868 fBodyAccMag-meanFreq()
0.0328805491166 tBodyGyroJerk-entropy()-Z
0.0327374255746 tBodyGyroJerkMag-arCoeff()3
0.0267752303541 tBodyAcc-arCoeff()-Z,4
0.0240181996008 tBodyAccJerk-arCoeff()-X,1
0.0237067101828 tBodyGyro-arCoeff()-Y,2
0.0184313807779 tBodyGyro-arCoeff()-X,4
0.0174189176525 angle(tBodyGyroJerkMean,gravityMean)
0.0151974117498 fBodyGyro-meanFreq()-Y
0.0141803927318 tBodyAccJerk-arCoeff()-Z,2
0.0140510581956 tBodyGyro-correlation()-Y,Z
0.0121648766824 fBodyGyro-entropy()-Z
0.00992216217266 tBodyGyroJerk-correlation()-X,Y
0.00215377964944 fBodyGyro-meanFreq()-X
-0.00414219174709 tBodyGyroJerk-arCoeff()-Y,2
-0.00706628542013 tBodyGyro-arCoeff()-Z,4
-0.0132529765856 tBodyAccJerk-mean()-Y
-0.0135294982223 fBodyGyro-entropy()-Y
-0.0190788744894 angle(tBodyAccJerkMean),gravityMean)
-0.0202594286111 tBodyAcc-entropy()-Z
-0.0292376915742 fBodyAccJerk-meanFreq()-X
-0.0313683663163 fBodyBodyAccJerkMag-meanFreq()
-0.0324442156276 tBodyAccJerk-arCoeff()-Y,2
-0.0341555811287 angle(tBodyGyroMean,gravityMean)
-0.0366040420095 tBodyGyroJerk-arCoeff()-X,2
-0.0367363850759 tBodyAccMag-arCoeff()4
-0.0367363850759 tGravityAccMag-arCoeff()4
-0.0371594517177 tBodyAccJerk-correlation()-Y,Z
-0.0433351711594 fBodyBodyGyroMag-entropy()
-0.0434904027503 tBodyGyro-mean()-Z
-0.0443626711323 tBodyAccJerk-correlation()-X,Z
-0.0454304423713 tBodyGyroMag-arCoeff()4
-0.0464489387411 fBodyGyro-entropy()-X
-0.0488936753149 tBodyAcc-arCoeff()-Y,4
-0.0545814665646 tBodyGyro-arCoeff()-Y,3
-0.0552462217102 tBodyAcc-entropy()-X
-0.0613778913045 angle(tBodyAccMean,gravity)
-0.0757464261295 tBodyGyroJerk-entropy()-X
-0.0829869378769 tBodyAccJerk-mean()-Z
-0.0960118554825 tBodyAccJerk-arCoeff()-X,2
-0.098224881132 tBodyGyroMag-arCoeff()3
-0.101799399436 tBodyGyroJerk-arCoeff()-Y,4
-0.10300178995 tBodyAcc-entropy()-Y
-0.106786202227 tBodyGyroJerk-entropy()-Y
-0.114114131131 tBodyAcc-arCoeff()-X,4
-0.114721730966 tBodyGyroJerkMag-entropy()
-0.121858685513 tBodyAccMag-arCoeff()3
-0.121858685513 tGravityAccMag-arCoeff()3
-0.130851232936 tBodyAcc-arCoeff()-Z,1
-0.137409984693 tGravityAcc-arCoeff()-Y,2
-0.137507565857 tBodyGyroJerk-arCoeff()-Z,4
-0.149653463758 tBodyGyroJerk-arCoeff()-Y,3
-0.155470925423 tBodyAccMag-entropy()
-0.155470925423 tGravityAccMag-entropy()
-0.159303581704 tBodyGyro-mean()-X
-0.163859158844 tBodyAcc-arCoeff()-Y,3
-0.175181130788 tBodyAccJerk-arCoeff()-Z,1
-0.178335455758 tBodyGyro-arCoeff()-Y,4
-0.186965065231 tGravityAcc-correlation()-Y,Z
-0.19922042653 tBodyGyroJerk-arCoeff()-X,4
-0.204214234218 fBodyAccJerk-meanFreq()-Z
-0.204612460278 tBodyAccJerk-arCoeff()-X,3
-0.213666571165 fBodyAcc-meanFreq()-Z
-0.215444900162 tBodyAccJerk-arCoeff()-Y,4
-0.217113781355 tGravityAcc-arCoeff()-Z,2
-0.219865381579 tBodyGyroJerk-arCoeff()-X,3
-0.22843411364 tBodyAccJerk-arCoeff()-X,4
-0.228733434969 tBodyAccJerkMag-arCoeff()1
-0.230064018175 tBodyGyro-arCoeff()-Z,3
-0.2443306184 tBodyGyroJerk-arCoeff()-Z,3
-0.253080762541 tBodyAccJerk-arCoeff()-Z,4
-0.254161363298 tBodyGyro-entropy()-Z
-0.277009317899 tBodyAcc-arCoeff()-X,3
-0.28403249056 tBodyGyro-arCoeff()-X,3
-0.289495524163 tBodyAccJerk-arCoeff()-Y,3
-0.291933023627 tBodyGyro-entropy()-Y
-0.295128885078 tBodyAcc-arCoeff()-Z,3
-0.303321162769 tBodyGyroMag-entropy()
-0.304452344728 tBodyAccJerk-arCoeff()-Z,3
-0.336804681174 tGravityAcc-arCoeff()-Z,4
-0.338993633106 fBodyBodyGyroJerkMag-meanFreq()
-0.342579661878 tBodyAcc-correlation()-Y,Z
-0.396042671642 tGravityAcc-correlation()-X,Y
-0.398682156012 tGravityAcc-arCoeff()-Y,4
-0.413061698455 tGravityAcc-arCoeff()-X,2
-0.429023069479 tBodyAcc-mean()-Y
-0.43082581152 tBodyGyroJerkMag-arCoeff()1
-0.582602194063 tGravityAcc-arCoeff()-X,4
-0.589216542858 tBodyAcc-min()-X
-0.716048380986 tBodyAccJerk-min()-X
-0.755145200486 tBodyAcc-min()-Y
-0.88019395377 tBodyAccJerk-min()-Y
-0.90709209691 angle(Z,gravityMean)
-0.930881345884 tBodyGyro-min()-Z
-1.10329333471 tBodyAccJerk-min()-Z
-1.13340084372 tBodyAcc-min()-Z
-1.18895812907 tBodyGyro-min()-X
-1.19444176141 tBodyGyroJerk-min()-X
-1.34253009705 tBodyGyro-min()-Y
-1.4254605819 angle(Y,gravityMean)
-1.42884856959 tGravityAcc-energy()-X
-1.49905735202 tBodyGyroJerk-min()-Z
-1.62648611505 tGravityAcc-min()-X
-1.62924391195 tGravityAcc-mean()-X
-1.64228675532 tGravityAcc-max()-X
-1.97948618158 tBodyGyroJerk-min()-Y
-3.48989403158 tBodyAcc-mean()-X

In addition to the KBest feature selection we can also use PCA to reduce the feature set size. First let's have a closer look at the two major components to see the natural distinction of the feature space defined by the PCA components.

In [21]:
from sklearn.decomposition import PCA

n_components = 2
pca = PCA(n_components=n_components).fit(X)
print pca.explained_variance_ratio_

# TODO: Transform the good data using the PCA fit above
reduced_data = pca.transform(X_train)
print X_train.shape
print reduced_data.shape

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
print reduced_data.shape

# Produce a scatter matrix for pca reduced data
pd.scatter_matrix(reduced_data, alpha = 0.8, figsize = (8,4), diagonal = 'kde');
[ 0.62227069  0.04772595]
(7352, 561)
(7352L, 2L)
(7352, 2)
In [6]:
# Produce a scatter matrix for each pair of features in the data
axes = pd.scatter_matrix(Xs, alpha = 0.3, figsize = (20,32), diagonal = 'kde')

# Reformat data.corr() for plotting
corr = Xs.corr().as_matrix()

# Plot scatter matrix with correlations
for i,j in zip(*np.triu_indices_from(axes, k=1)):
    axes[i,j].annotate("%.2f"%corr[i,j], (0.1,0.25), xycoords='axes fraction', color='red', fontsize=16)

Skewness greater than zero shows a positively skewed distribution, while lower than zero shows a negatively skewed distribution. Replacing the data with the log, square root, or inverse may help to remove the skew. However, feature values of the current dataset with selected features change between -1 and 1. Therefore, sqrt and log is not applicable. If we apply any of those transformations, most of the feature values will turn into NaN and dataset will be useless.

To avoid this we first shift data to a non-negative range, then apply the non-linear transformation, after that scale it back to -1 and 1 to be able to compare the change in the feature distribution with bare eyes. If all go right, we should be able to see less skewed feature distribution.

In addition to sqrt-ing and log-ing, I will also try boxcox-ing to reduce the skewness.

Discussion 2

SVM's classification performance for 16, 56, and 561 features are as follows:
n_features: 16 t_train: 0.802sec t_pred: 0.620sec precision: 0.84 recall: 0.81 fscore: 0.79
n_features: 56 t_train: 1.421sec t_pred: 1.247sec precision: 0.88 recall: 0.83 fscore: 0.81
n_features: 561 t_train: 8.916sec t_pred: 7.667sec precision: 0.94 recall: 0.94 fscore: 0.94

Using 16-feature reduced the training and testing time by more than 10 times while losing 10% classification performance measured as precision, recall and fscore. When compared to choosing 56 best features (top 10% of the whole feature vector), we see that 16-feature is almost as good as 56-feature in classification performance. 16-feature is twice faster than 56-feature in training and testing times.

As 16-feature is good enough for SVM, now I will find ways to improve the classification performance by scaling, normalizing and outlier-removal. First, let's have a look at how the features are distributed by using a correlation matrix.

In [7]:
import scipy.stats.stats as st
skness = st.skew(Xs)

for skew, feature_name in zip(skness , Xs_cols.tolist()):
    print "skewness: {:+.2f}\t\t feature: ".format(skew) + feature_name
skewness: +0.64		 feature: tBodyAcc-std()-X
skewness: +0.60		 feature: tBodyAcc-max()-X
skewness: -1.63		 feature: tGravityAcc-mean()-X
skewness: -1.64		 feature: tGravityAcc-max()-X
skewness: -1.63		 feature: tGravityAcc-min()-X
skewness: -1.43		 feature: tGravityAcc-energy()-X
skewness: +0.11		 feature: tBodyAccJerk-entropy()-X
skewness: +0.07		 feature: tBodyAccJerk-entropy()-Y
skewness: +0.17		 feature: tBodyAccJerk-entropy()-Z
skewness: +0.08		 feature: tBodyAccJerkMag-entropy()
skewness: +0.13		 feature: fBodyAcc-entropy()-X
skewness: +0.20		 feature: fBodyAccJerk-entropy()-X
skewness: +0.19		 feature: fBodyAccJerk-entropy()-Y
skewness: +0.27		 feature: fBodyAccJerk-entropy()-Z
skewness: +0.23		 feature: fBodyBodyAccJerkMag-entropy()
skewness: +1.42		 feature: angle(X,gravityMean)
In [8]:
from sklearn import preprocessing
from scipy.stats import boxcox

plt.rcParams['figure.figsize'] = (20.0, 80.0)
f, axarr = plt.subplots(len(Xs_cols.tolist()), 4, sharey=True)

preprocessing_names = ["noproc", "sqrted", "logged", "bxcxed"]

cnt = 0
for feature in Xs_cols.tolist():
    
    for i in range(4):
#         axarr[cnt, i].set_title( "[" + preprocessing_names[i] + "] "+ feature)
        axarr[cnt, i].set_title(feature + " histogram")
        axarr[cnt, i].set_xlabel(feature)
        axarr[cnt, i].set_ylabel("number of data points")
    
    Xs_feature = Xs[feature]
    skness = st.skew(Xs_feature)
    axarr[cnt, 0].hist(Xs_feature,facecolor='blue',alpha=0.75)
    axarr[cnt, 0].text(0.05, 0.95, 'Skewness[noproc]: {:.2f}'.format(skness), transform=axarr[cnt, 0].transAxes, 
                       fontsize=12, verticalalignment='top', color='red')
    
    Xs_feature_scaled = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_feature)

    Xs_feature_sqrted = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.sqrt(Xs_feature_scaled))
#     Xs_feature_sqrted = preprocessing.scale(np.sqrt(Xs_feature_scaled))
    skness = st.skew(Xs_feature_sqrted)
    axarr[cnt, 1].hist(Xs_feature_sqrted,facecolor='blue',alpha=0.75)
    axarr[cnt, 1].text(0.05, 0.95, 'Skewness[sqrted]: {:.2f}'.format(skness), transform=axarr[cnt, 1].transAxes, 
                       fontsize=12, verticalalignment='top', color='green')
    
    Xs_feature_logged = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.log(Xs_feature_scaled))
#     Xs_feature_logged = preprocessing.scale(np.log(Xs_feature_scaled))
    skness = st.skew(Xs_feature_logged)
    axarr[cnt, 2].hist(Xs_feature_logged,facecolor='blue',alpha=0.75)
    axarr[cnt, 2].text(0.05, 0.95, 'Skewness[logged]: {:.2f}'.format(skness), transform=axarr[cnt, 2].transAxes, 
                       fontsize=12, verticalalignment='top', color='green')
    
    Xs_feature_bxcxed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(boxcox(Xs_feature_scaled)[0])
#     Xs_feature_bxcxed = preprocessing.scale(boxcox(Xs_feature_scaled)[0])
    skness = st.skew(Xs_feature_bxcxed)
    axarr[cnt, 3].hist(Xs_feature_bxcxed,facecolor='blue',alpha=0.75)
    axarr[cnt, 3].text(0.05, 0.95, 'Skewness[bxcxed]: {:.2f}'.format(skness), transform=axarr[cnt, 3].transAxes, fontsize=12, 
                       verticalalignment='top', color='green',  bbox=dict(facecolor='white', alpha=0.5, boxstyle='square'))    
    cnt += 1

plt.show()

Tried robust scaler but it didn't have any effect on the dataset's skewness

In [9]:
Xs_rscaled = preprocessing.RobustScaler().fit_transform(Xs)
print Xs_rscaled.shape

for feature in range(Xs_rscaled.shape[1]):
    Xs_rscaled_feature = Xs_rscaled[:,feature]
    skness = st.skew(Xs_rscaled_feature)
    print "{:2d}".format(feature) + "  {:+.2f}".format(skness)
(10299L, 16L)
 0  +0.64
 1  +0.60
 2  -1.63
 3  -1.64
 4  -1.63
 5  -1.43
 6  +0.11
 7  +0.07
 8  +0.17
 9  +0.08
10  +0.13
11  +0.20
12  +0.19
13  +0.27
14  +0.23
15  +1.42
In [10]:
def boxCoxData(data):
    data_bxcxed = []
    for feature in range(data.shape[1]):
        data_bxcxed_feature, maxlog = boxcox(data[:,feature])
        if feature == 0:
            data_bxcxed = data_bxcxed_feature
        else:
            data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
    return data_bxcxed

def ScaleData(data):
    data_scaled = []
    for feature in range(data.shape[1]):
        data_scaled_feature = preprocessing.scale(data[:,feature])
        if feature == 0:
            data_scaled = data_scaled_feature
        else:
            data_scaled = np.column_stack([data_scaled, data_scaled_feature])
    return data_scaled

def testSVMPerformance(data_train, label_train, data_test, label_test, preprocess_method):
    
    if preprocess_method != "":
        data_train = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_train)
        data_test = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_test)
    
        if preprocess_method == "logged":
            data_train = np.log(data_train)
            data_test = np.log(data_test)
        elif preprocess_method == "sqrted":
            data_train = np.sqrt(data_train)
            data_test = np.sqrt(data_test)
        elif preprocess_method == "bxcxed":
            data_train = boxCoxData(data_train)
            data_test = boxCoxData(data_test)
            
        #this resulted in a more inferior performance compared to preprocessing.scale method
#         data_train = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_train)
#         data_test = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_test)

        data_train = ScaleData(data_train)
        data_test = ScaleData(data_test)        
        
    start = time()
    clf_SVM.fit(data_train, label_train)
    end = time()
    t_train = end - start
    #NOTE: For some reason this doesn't work here
#     t_train = train(clf_SVM, data_train, label_train)
    t_test, y_pred = predict(clf_SVM, data_test)
    precision, recall, fscore, support = precision_recall_fscore_support(label_test, y_pred, average='weighted')

    printout = preprocess_method
    if preprocess_method == "":
        printout = "noproc"
    
    printout += "  t_train: {:.3f}sec".format(t_train)
    printout += "  t_pred: {:.3f}sec".format(t_test)
    printout += "  precision: {:.2f}".format(precision)
    printout += "  recall: {:.2f}".format(recall)
    printout += "  fscore: {:.2f}".format(fscore)
    print printout

X_train_processed = X_train[Xs_cols]
X_test_processed = X_test[Xs_cols]

testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "scaled")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "logged")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "sqrted")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "bxcxed")
noproc  t_train: 0.808sec  t_pred: 0.621sec  precision: 0.84  recall: 0.81  fscore: 0.79
scaled  t_train: 0.715sec  t_pred: 0.549sec  precision: 0.85  recall: 0.82  fscore: 0.81
logged  t_train: 0.733sec  t_pred: 0.594sec  precision: 0.84  recall: 0.82  fscore: 0.82
sqrted  t_train: 0.713sec  t_pred: 0.555sec  precision: 0.84  recall: 0.82  fscore: 0.82
bxcxed  t_train: 0.649sec  t_pred: 0.538sec  precision: 0.87  recall: 0.87  fscore: 0.87

It is time to test if there is any outlier in the boxcoxed dataset.

In [11]:
Xs_processed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs)
Xs_bxcxed = boxCoxData(Xs_processed)
Xs_bxcxed_scaled = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(Xs_bxcxed)

outliers = []
for feature in range(Xs_bxcxed_scaled.shape[1]):
    Q1 = np.percentile(Xs_bxcxed_scaled[:, feature], 25)
    Q3 = np.percentile(Xs_bxcxed_scaled[:, feature], 75)
    step = 1.5 * (Q3 - Q1)

    outlier_filter = ~((Xs_bxcxed_scaled[:, feature] >= Q1 - step) & (Xs_bxcxed_scaled[:, feature] <= Q3 + step))
    
    cnt = 0
    for outlier in outlier_filter:
        if outlier:
            outliers.append(cnt)
        cnt += 1
    
# print "number of outliers with repeating indices: " + str(len(outliers))

id2cnt = {}
for outlier in outliers:
    if not outlier in id2cnt:
        id2cnt[outlier] = 1
    else:
        id2cnt[outlier] += 1
    
sorted_id2cnt = sorted(id2cnt.items(), key=operator.itemgetter(1), reverse=True)
cnt2nindices = {}
for key, value in sorted_id2cnt:
    #only remove the outliers that are repeated more than once
    if value <=1:
        break
    if not value in cnt2nindices:
        cnt2nindices[value] = 1
    else:
        cnt2nindices[value] += 1

for key, value in cnt2nindices.iteritems():
    print "{:2d} features share {:4d} potential outliers".format(key, value)
 2 features share   23 potential outliers
 3 features share 1953 potential outliers

Let's try to remove those 1953 potential outliers and test the performance of the SVM again. Although, this seems like losing too much data, I just want to see how this may effect the learning performance.

In [12]:
removed_outliers = []
for key, value in sorted_id2cnt:
    if value == 3:
        removed_outliers.append(key)

y_labels = y['Activity']

results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
    clf_SVM.fit(Xs.iloc[train], y_labels.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs.iloc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

printout = "subsetsize: {:5d}".format(Xs.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout

Xs_filtered = Xs.drop(removed_outliers)
y_filtered = y.drop(removed_outliers)
y_filtered_labels = y_filtered['Activity'].to_frame()

Xs_filtered_proc = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_filtered)
Xs_filtered_proc = boxCoxData(Xs_filtered_proc)
Xs_filtered_proc = ScaleData(Xs_filtered_proc)
Xs_filtered_proc = Xs_filtered_proc

results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered_proc.shape[0], n_folds=10, shuffle=False, random_state=42)  
for train, test in kfold:
    clf_SVM.fit(Xs_filtered_proc[train], y_filtered_labels.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs_filtered_proc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

print "**************"
printout = "subsetsize: {:5d}".format(Xs_filtered_proc.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout


results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered.shape[0], n_folds=10, shuffle=False, random_state=42)  
for train, test in kfold:
    clf_SVM.fit(Xs_filtered.iloc[train], y_filtered_labels.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs_filtered.iloc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

printout = "subsetsize: {:5d}".format(Xs_filtered.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout

from random import sample
n_multiplier = Xs.shape[0]/500

for i in range(1, n_multiplier+1):
    subsetsize = i*500
    random_index = sample(range(0, Xs.shape[0]), subsetsize)
    
    Xs_subset = Xs.iloc[random_index]
    y_subset = y_labels.iloc[random_index].to_frame()

    results_precision = []
    results_recall = []
    results_fscore = []
    kfold = cross_validation.KFold(Xs_subset.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        clf_SVM.fit(Xs_subset.iloc[train], y_subset.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
        t_test, y_pred = predict(clf_SVM, Xs_subset.iloc[test])
        precision, recall, fscore, support = precision_recall_fscore_support(y_subset.iloc[test], y_pred, 
                                                                             average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)

    printout = "subsetsize: {:5d}".format(subsetsize)
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    print printout
subsetsize: 10299  precision: 0.85  recall: 0.82  fscore: 0.81
**************
subsetsize:  8346  precision: 0.88  recall: 0.87  fscore: 0.87
subsetsize:  8346  precision: 0.82  recall: 0.78  fscore: 0.76
subsetsize:   500  precision: 0.81  recall: 0.77  fscore: 0.74
subsetsize:  1000  precision: 0.85  recall: 0.79  fscore: 0.76
subsetsize:  1500  precision: 0.82  recall: 0.80  fscore: 0.79
subsetsize:  2000  precision: 0.81  recall: 0.79  fscore: 0.79
subsetsize:  2500  precision: 0.85  recall: 0.80  fscore: 0.77
subsetsize:  3000  precision: 0.84  recall: 0.81  fscore: 0.80
subsetsize:  3500  precision: 0.86  recall: 0.82  fscore: 0.79
subsetsize:  4000  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  4500  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  5000  precision: 0.85  recall: 0.83  fscore: 0.81
subsetsize:  5500  precision: 0.85  recall: 0.82  fscore: 0.80
subsetsize:  6000  precision: 0.86  recall: 0.83  fscore: 0.81
subsetsize:  6500  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  7000  precision: 0.86  recall: 0.82  fscore: 0.81
subsetsize:  7500  precision: 0.85  recall: 0.83  fscore: 0.81
subsetsize:  8000  precision: 0.86  recall: 0.82  fscore: 0.81
subsetsize:  8500  precision: 0.85  recall: 0.83  fscore: 0.82
subsetsize:  9000  precision: 0.86  recall: 0.83  fscore: 0.82
subsetsize:  9500  precision: 0.86  recall: 0.83  fscore: 0.81
subsetsize: 10000  precision: 0.86  recall: 0.83  fscore: 0.82

This shows that feature preprocessing and outlier removal are tied together. In other words, detected outliers are special to the space they are transformed to through preprocessing methods. Therefore, outliers in transformed space may not be outliers in the original space. Following results show that removing the outliers is only good if the learning is done on the space where those features transformed to.

subsetsize: 8346 precision: 0.87 recall: 0.86 fscore: 0.86 (features are preprocessed)
subsetsize: 8346 precision: 0.80 recall: 0.76 fscore: 0.74 (features are kept as the way they are)

In [17]:
from sklearn.decomposition import PCA

num_components = range(2, X.shape[1]/10, 2) + range(X.shape[1]/10, X.shape[1]/3, X.shape[1]/40)

for n_components in num_components:
    pca = PCA(n_components=n_components).fit(X)
#     print pca.explained_variance_ratio_
    printout = "n_components: {:d}".format(n_components)
    X_pcaed = pca.transform(X)
    
    results_precision = []
    results_recall = []
    results_fscore = []    
    kfold = cross_validation.KFold(X_pcaed.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        clf_SVM.fit(X_pcaed[train], y_labels.iloc[train])
        t_test, y_pred = predict(clf_SVM, X_pcaed[test])
        precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred, 
                                                                             average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)

    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    print printout
n_components: 2  precision: 0.64  recall: 0.63  fscore: 0.61
n_components: 4  precision: 0.80  recall: 0.78  fscore: 0.77
n_components: 6  precision: 0.84  recall: 0.83  fscore: 0.82
n_components: 8  precision: 0.86  recall: 0.85  fscore: 0.85
n_components: 10  precision: 0.88  recall: 0.88  fscore: 0.88
n_components: 12  precision: 0.89  recall: 0.88  fscore: 0.88
n_components: 14  precision: 0.90  recall: 0.89  fscore: 0.89
n_components: 16  precision: 0.91  recall: 0.90  fscore: 0.90
n_components: 18  precision: 0.91  recall: 0.90  fscore: 0.90
n_components: 20  precision: 0.91  recall: 0.90  fscore: 0.90
n_components: 22  precision: 0.92  recall: 0.91  fscore: 0.91
n_components: 24  precision: 0.92  recall: 0.91  fscore: 0.91
n_components: 26  precision: 0.92  recall: 0.91  fscore: 0.91
n_components: 28  precision: 0.92  recall: 0.92  fscore: 0.92
n_components: 30  precision: 0.93  recall: 0.92  fscore: 0.92
n_components: 32  precision: 0.93  recall: 0.92  fscore: 0.92
n_components: 34  precision: 0.93  recall: 0.92  fscore: 0.92
n_components: 36  precision: 0.93  recall: 0.92  fscore: 0.92
n_components: 38  precision: 0.93  recall: 0.93  fscore: 0.93
n_components: 40  precision: 0.94  recall: 0.93  fscore: 0.93
n_components: 42  precision: 0.93  recall: 0.93  fscore: 0.93
n_components: 44  precision: 0.93  recall: 0.93  fscore: 0.93
n_components: 46  precision: 0.94  recall: 0.93  fscore: 0.93
n_components: 48  precision: 0.93  recall: 0.93  fscore: 0.93
n_components: 50  precision: 0.93  recall: 0.93  fscore: 0.93
n_components: 52  precision: 0.94  recall: 0.93  fscore: 0.93
n_components: 54  precision: 0.94  recall: 0.93  fscore: 0.93
n_components: 56  precision: 0.94  recall: 0.94  fscore: 0.94
n_components: 70  precision: 0.94  recall: 0.94  fscore: 0.94
n_components: 84  precision: 0.95  recall: 0.94  fscore: 0.94
n_components: 98  precision: 0.95  recall: 0.94  fscore: 0.94
n_components: 112  precision: 0.95  recall: 0.94  fscore: 0.94
n_components: 126  precision: 0.95  recall: 0.95  fscore: 0.95
n_components: 140  precision: 0.95  recall: 0.95  fscore: 0.95
n_components: 154  precision: 0.95  recall: 0.95  fscore: 0.95
n_components: 168  precision: 0.95  recall: 0.95  fscore: 0.95
n_components: 182  precision: 0.95  recall: 0.95  fscore: 0.95
In [20]:
from sklearn.feature_selection import SelectKBest
from scipy.stats import boxcox
from sklearn import preprocessing
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

def boxCoxData(data):
    data_bxcxed = []
    for feature in range(data.shape[1]):
        data_bxcxed_feature, maxlog = boxcox(data[:,feature])
        if feature == 0:
            data_bxcxed = data_bxcxed_feature
        else:
            data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
    return data_bxcxed

def ScaleData(data):
    data_scaled = []
    for feature in range(data.shape[1]):
        data_scaled_feature = preprocessing.scale(data[:,feature])
        if feature == 0:
            data_scaled = data_scaled_feature
        else:
            data_scaled = np.column_stack([data_scaled, data_scaled_feature])
    return data_scaled

def predict(clf, features):
    start = time()
    pred = clf.predict(features)
    end = time()
    return end - start, pred

kbest_param_vals = [5, 10, 15, 20, 30, 50, 100, 200, X.shape[1]]
pca_n_components = [2, 5, 10, 15, 20, 30, 40, 50, 100, 200]

for kbest in kbest_param_vals:
    start = time()
    #choose kbest feature dimensions
    f_selector = SelectKBest(k=kbest)
    X_slctd = f_selector.fit(X, y['Activity']).transform(X)
    f_selected_indices = f_selector.get_support(indices=False)
    X_slctd_cols = X.columns[f_selected_indices]
    
    #transform these features to another space where they are less skewed
    X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(X_slctd)
    X_slctd_tformed = boxCoxData(X_slctd_tformed)
    X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(X_slctd_tformed)
    X_slctd_tformed = pd.DataFrame(data=X_slctd_tformed, index=range(X_slctd_tformed.shape[0]), columns=X_slctd_cols)
    end = time()
    
    for pca_n in pca_n_components:
        column_names = []
        for i in range(pca_n):
            column_names.append("component{:2d}".format(i))
            
        start_pca = time()
        pca = PCA(n_components=pca_n).fit(X)
        X_pcaed = pca.transform(X)
        X_pcaed = pd.DataFrame(data=X_pcaed, index=range(X_pcaed.shape[0]), columns=column_names)
        
        X_combined = pd.concat([X_slctd_tformed, X_pcaed], axis=1)
        end_pca = time()
        t_proc = (end - start) + (end_pca - start_pca)
        results_precision = []
        results_recall = []
        results_fscore = []
        kfold = cross_validation.KFold(X_combined.shape[0], n_folds=10, shuffle=False, random_state=42)
        t_trains = []
        t_tests = []
        for train, test in kfold:
            t_train_s = time()
            clf_SVM.fit(X_combined.iloc[train], y_labels.iloc[train])
            t_trains.append( time() - t_train_s )
            t_test, y_pred = predict(clf_SVM, X_combined.iloc[test])
            t_tests.append(t_test)
            precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred, 
                                                                                 average='weighted')
            results_precision.append(precision)
            results_recall.append(recall)
            results_fscore.append(fscore)

        printout = "(kbest{:3d})(pca_n{:3d})".format(kbest, pca_n)
        printout += "  precision: {:.2f}".format(np.mean(results_precision))
        printout += "  recall: {:.2f}".format(np.mean(results_recall))
        printout += "  fscore: {:.2f}\t".format(np.mean(results_fscore))
        printout += "  t_proc: {:.2f}  t_train: {:.2f}  t_test: {:.2f}".format(t_proc, np.mean(t_trains), np.mean(t_tests))
        print printout
(kbest  5)(pca_n  2)  precision: 0.79  recall: 0.76  fscore: 0.75	  t_proc: 1.50  t_train: 1.16  t_test: 0.22
(kbest  5)(pca_n  5)  precision: 0.83  recall: 0.82  fscore: 0.81	  t_proc: 1.44  t_train: 1.03  t_test: 0.19
(kbest  5)(pca_n 10)  precision: 0.89  recall: 0.89  fscore: 0.89	  t_proc: 1.44  t_train: 0.81  t_test: 0.15
(kbest  5)(pca_n 15)  precision: 0.90  recall: 0.89  fscore: 0.89	  t_proc: 1.49  t_train: 0.86  t_test: 0.16
(kbest  5)(pca_n 20)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.46  t_train: 0.93  t_test: 0.18
(kbest  5)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 1.41  t_train: 0.98  t_test: 0.20
(kbest  5)(pca_n 40)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.56  t_train: 1.07  t_test: 0.23
(kbest  5)(pca_n 50)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 1.58  t_train: 1.16  t_test: 0.25
(kbest  5)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.50  t_train: 1.81  t_test: 0.42
(kbest  5)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.95	  t_proc: 1.54  t_train: 3.65  t_test: 0.90
(kbest 10)(pca_n  2)  precision: 0.81  recall: 0.78  fscore: 0.77	  t_proc: 1.49  t_train: 1.19  t_test: 0.25
(kbest 10)(pca_n  5)  precision: 0.84  recall: 0.83  fscore: 0.83	  t_proc: 1.44  t_train: 1.07  t_test: 0.21
(kbest 10)(pca_n 10)  precision: 0.89  recall: 0.89  fscore: 0.89	  t_proc: 1.47  t_train: 0.85  t_test: 0.17
(kbest 10)(pca_n 15)  precision: 0.90  recall: 0.89  fscore: 0.89	  t_proc: 1.43  t_train: 0.91  t_test: 0.18
(kbest 10)(pca_n 20)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.45  t_train: 0.97  t_test: 0.20
(kbest 10)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 1.50  t_train: 1.01  t_test: 0.22
(kbest 10)(pca_n 40)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.65  t_train: 1.12  t_test: 0.24
(kbest 10)(pca_n 50)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 1.51  t_train: 1.23  t_test: 0.27
(kbest 10)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.52  t_train: 1.93  t_test: 0.45
(kbest 10)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.95	  t_proc: 1.58  t_train: 3.72  t_test: 0.92
(kbest 15)(pca_n  2)  precision: 0.85  recall: 0.84  fscore: 0.84	  t_proc: 1.58  t_train: 1.21  t_test: 0.25
(kbest 15)(pca_n  5)  precision: 0.87  recall: 0.87  fscore: 0.86	  t_proc: 1.61  t_train: 1.05  t_test: 0.22
(kbest 15)(pca_n 10)  precision: 0.90  recall: 0.90  fscore: 0.90	  t_proc: 1.70  t_train: 0.89  t_test: 0.18
(kbest 15)(pca_n 15)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.62  t_train: 0.98  t_test: 0.21
(kbest 15)(pca_n 20)  precision: 0.92  recall: 0.91  fscore: 0.91	  t_proc: 1.79  t_train: 1.08  t_test: 0.23
(kbest 15)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 1.70  t_train: 1.06  t_test: 0.23
(kbest 15)(pca_n 40)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.58  t_train: 1.17  t_test: 0.26
(kbest 15)(pca_n 50)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.60  t_train: 1.28  t_test: 0.29
(kbest 15)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.56  t_train: 1.97  t_test: 0.47
(kbest 15)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.95	  t_proc: 1.69  t_train: 3.88  t_test: 0.94
(kbest 20)(pca_n  2)  precision: 0.86  recall: 0.85  fscore: 0.85	  t_proc: 1.64  t_train: 1.26  t_test: 0.28
(kbest 20)(pca_n  5)  precision: 0.87  recall: 0.87  fscore: 0.86	  t_proc: 1.66  t_train: 1.11  t_test: 0.24
(kbest 20)(pca_n 10)  precision: 0.90  recall: 0.90  fscore: 0.90	  t_proc: 1.69  t_train: 0.98  t_test: 0.21
(kbest 20)(pca_n 15)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.78  t_train: 1.02  t_test: 0.23
(kbest 20)(pca_n 20)  precision: 0.91  recall: 0.91  fscore: 0.91	  t_proc: 1.75  t_train: 1.09  t_test: 0.24
(kbest 20)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 1.87  t_train: 1.15  t_test: 0.26
(kbest 20)(pca_n 40)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.64  t_train: 1.23  t_test: 0.28
(kbest 20)(pca_n 50)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.65  t_train: 1.34  t_test: 0.31
(kbest 20)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.75  t_train: 2.03  t_test: 0.48
(kbest 20)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.94	  t_proc: 1.67  t_train: 3.85  t_test: 0.96
(kbest 30)(pca_n  2)  precision: 0.86  recall: 0.85  fscore: 0.85	  t_proc: 1.79  t_train: 1.48  t_test: 0.34
(kbest 30)(pca_n  5)  precision: 0.88  recall: 0.87  fscore: 0.87	  t_proc: 1.78  t_train: 1.31  t_test: 0.30
(kbest 30)(pca_n 10)  precision: 0.90  recall: 0.90  fscore: 0.89	  t_proc: 1.85  t_train: 1.09  t_test: 0.25
(kbest 30)(pca_n 15)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.83  t_train: 1.14  t_test: 0.26
(kbest 30)(pca_n 20)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 1.78  t_train: 1.21  t_test: 0.28
(kbest 30)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 1.80  t_train: 1.25  t_test: 0.29
(kbest 30)(pca_n 40)  precision: 0.93  recall: 0.93  fscore: 0.92	  t_proc: 1.80  t_train: 1.36  t_test: 0.32
(kbest 30)(pca_n 50)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 1.76  t_train: 1.50  t_test: 0.35
(kbest 30)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.88  t_train: 2.18  t_test: 0.53
(kbest 30)(pca_n200)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 1.91  t_train: 4.06  t_test: 1.01
(kbest 50)(pca_n  2)  precision: 0.86  recall: 0.85  fscore: 0.84	  t_proc: 2.18  t_train: 1.89  t_test: 0.45
(kbest 50)(pca_n  5)  precision: 0.89  recall: 0.88  fscore: 0.88	  t_proc: 2.24  t_train: 1.66  t_test: 0.39
(kbest 50)(pca_n 10)  precision: 0.90  recall: 0.90  fscore: 0.90	  t_proc: 2.15  t_train: 1.40  t_test: 0.34
(kbest 50)(pca_n 15)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 2.19  t_train: 1.45  t_test: 0.35
(kbest 50)(pca_n 20)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 2.19  t_train: 1.51  t_test: 0.36
(kbest 50)(pca_n 30)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 2.19  t_train: 1.53  t_test: 0.37
(kbest 50)(pca_n 40)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 2.14  t_train: 1.67  t_test: 0.40
(kbest 50)(pca_n 50)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 2.16  t_train: 1.78  t_test: 0.43
(kbest 50)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 2.14  t_train: 2.50  t_test: 0.61
(kbest 50)(pca_n200)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 2.26  t_train: 4.43  t_test: 1.11
(kbest100)(pca_n  2)  precision: 0.91  recall: 0.90  fscore: 0.90	  t_proc: 3.12  t_train: 2.90  t_test: 0.73
(kbest100)(pca_n  5)  precision: 0.92  recall: 0.91  fscore: 0.91	  t_proc: 3.15  t_train: 2.52  t_test: 0.63
(kbest100)(pca_n 10)  precision: 0.92  recall: 0.91  fscore: 0.91	  t_proc: 3.11  t_train: 2.19  t_test: 0.55
(kbest100)(pca_n 15)  precision: 0.92  recall: 0.91  fscore: 0.91	  t_proc: 3.08  t_train: 2.24  t_test: 0.56
(kbest100)(pca_n 20)  precision: 0.92  recall: 0.92  fscore: 0.92	  t_proc: 3.16  t_train: 2.30  t_test: 0.57
(kbest100)(pca_n 30)  precision: 0.93  recall: 0.93  fscore: 0.93	  t_proc: 3.13  t_train: 2.30  t_test: 0.57
(kbest100)(pca_n 40)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 3.14  t_train: 2.42  t_test: 0.60
(kbest100)(pca_n 50)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 3.13  t_train: 2.56  t_test: 0.63
(kbest100)(pca_n100)  precision: 0.95  recall: 0.95  fscore: 0.94	  t_proc: 3.14  t_train: 3.34  t_test: 0.83
(kbest100)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.95	  t_proc: 3.24  t_train: 5.36  t_test: 1.37
(kbest200)(pca_n  2)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 5.48  t_train: 4.92  t_test: 1.26
(kbest200)(pca_n  5)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 5.48  t_train: 4.35  t_test: 1.15
(kbest200)(pca_n 10)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 5.50  t_train: 3.95  t_test: 1.02
(kbest200)(pca_n 15)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 5.49  t_train: 4.00  t_test: 1.02
(kbest200)(pca_n 20)  precision: 0.93  recall: 0.92  fscore: 0.92	  t_proc: 5.50  t_train: 4.07  t_test: 1.04
(kbest200)(pca_n 30)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 5.47  t_train: 4.01  t_test: 1.03
(kbest200)(pca_n 40)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 5.50  t_train: 4.14  t_test: 1.06
(kbest200)(pca_n 50)  precision: 0.94  recall: 0.94  fscore: 0.93	  t_proc: 5.52  t_train: 4.24  t_test: 1.08
(kbest200)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 5.54  t_train: 5.04  t_test: 1.29
(kbest200)(pca_n200)  precision: 0.95  recall: 0.95  fscore: 0.94	  t_proc: 5.56  t_train: 7.13  t_test: 1.85
(kbest561)(pca_n  2)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 17.95  t_train: 10.98  t_test: 2.77
(kbest561)(pca_n  5)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 17.98  t_train: 10.31  t_test: 2.64
(kbest561)(pca_n 10)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 17.98  t_train: 10.26  t_test: 2.59
(kbest561)(pca_n 15)  precision: 0.94  recall: 0.93  fscore: 0.93	  t_proc: 17.97  t_train: 10.15  t_test: 2.56
(kbest561)(pca_n 20)  precision: 0.94  recall: 0.94  fscore: 0.93	  t_proc: 17.96  t_train: 10.31  t_test: 2.60
(kbest561)(pca_n 30)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 18.13  t_train: 10.26  t_test: 2.60
(kbest561)(pca_n 40)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 17.98  t_train: 10.30  t_test: 2.59
(kbest561)(pca_n 50)  precision: 0.94  recall: 0.94  fscore: 0.94	  t_proc: 17.99  t_train: 10.47  t_test: 2.63
(kbest561)(pca_n100)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 17.97  t_train: 11.40  t_test: 2.86
(kbest561)(pca_n200)  precision: 0.95  recall: 0.94  fscore: 0.94	  t_proc: 17.96  t_train: 13.67  t_test: 3.46

It is now time to optimize training parameters of the models while incorporating best set of features. My hypothesis is that this will yield in the best classification performance.[NOTE: TO BE CONTINUED]

In [22]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    return r2_score(y_true, y_predict)

from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit

def fit_model(X, y):

    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
    params = {'max_depth': range(1,20)}
    
    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    regressor = DecisionTreeRegressor(max_depth = params['max_depth'], random_state=42)
    grid = GridSearchCV(regressor, param_grid=params, scoring=scoring_fnc)
    grid = grid.fit(X, y)
    return grid.best_estimator_

# clf = fit_model(X,y)
clf = fit_model(X_train,y_train)

print clf.score(X_train, y_train)
print clf.score(X_test, y_test)
0.190959174602
-7.59452573956